Chapter 19 - Exercises¶

In [1]:
library(tidyverse)
library(bayesrules)
library(bayesplot)
library(rstan)
library(rstanarm)
library(broom.mixed)
library(tidybayes)
library(forcats)
options(mc.cores = parallel::detectCores())
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
This is bayesplot version 1.10.0

- Online documentation and vignettes at mc-stan.org/bayesplot

- bayesplot theme set to bayesplot::theme_default()

   * Does _not_ affect other ggplot2 plots

   * See ?bayesplot_theme_set for details on theme setting

Loading required package: StanHeaders

rstan (Version 2.21.8, GitRev: 2e1f913d3ca3)

For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)


Attaching package: ‘rstan’


The following object is masked from ‘package:tidyr’:

    extract


Loading required package: Rcpp

This is rstanarm version 2.21.4

- See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!

- Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.

- For execution on a local, multicore CPU with excess RAM we recommend calling

  options(mc.cores = parallel::detectCores())


Attaching package: ‘rstanarm’


The following object is masked from ‘package:rstan’:

    loo


Exercise 19.1¶

Is there a book-level? I.e. are the characteristics of the book always the same, regardless of state and challenge?

Note that there are sometimes multiple challenges per state:

In [2]:
book_banning %>% filter( title  %in% "A Separate Reality" )
A data.frame: 2 × 17
titlebook_idauthordateyearremovedexplicitantifamilyoccultlanguagelgbtqviolentstatepolitical_value_indexmedian_incomehs_grad_ratecollege_grad_rate
<chr><fct><chr><date><dbl><fct><fct><fct><fct><fct><fct><fct><chr><dbl><dbl><dbl><dbl>
A Separate Reality46Castaneda, Carlos2001-05-2520010000000OR41274.55.5380421.07627
A Separate Reality46Casteneda, Carlos2000-06-2920000000000OR41274.55.5380421.07627
In [3]:
book_banning %>% filter( title  %in% "Always Running" )
A data.frame: 3 × 17
titlebook_idauthordateyearremovedexplicitantifamilyoccultlanguagelgbtqviolentstatepolitical_value_indexmedian_incomehs_grad_ratecollege_grad_rate
<chr><fct><chr><date><dbl><fct><fct><fct><fct><fct><fct><fct><chr><dbl><dbl><dbl><dbl>
Always Running118Rodriguez, Luis2004-09-0120041110101CA7.410119-2.7619582.57627
Always Running118Rodriguez, Luis2004-12-0520041100001CA7.410119-2.7619582.57627
Always Running118Rodriguez, Luis2004-10-1520041100001CA7.410119-2.7619582.57627

Are book IDs mapped to title in unique way? In other words can multiple book IDs be assigned to the same title? Can the same title be assigned to the same book ID?

There is only one book ID for a given title.

In [4]:
book_banning %>% count(book_id, title) %>% count( title ) %>% filter(n > 1)
A data.frame: 0 × 2
titlen
<chr><int>

However there are multiple titles assigned to the same book ID:

In [5]:
book_banning %>% count(book_id, title) %>% count( book_id ) %>% filter(n > 1) %>% head()
A data.frame: 6 × 2
book_idn
<fct><int>
11002
21482
33563
47172
57222
67242

Some examples:

In [6]:
book_banning  %>% filter( book_id==100 ) %>% select( title, book_id, author, date )
A data.frame: 2 × 4
titlebook_idauthordate
<chr><fct><chr><date>
Alice on Her Way100Naylor, Phyllis Reynolds2007-09-08
Alice On Her Way100Naylor, Phyllis Reynolds2009-11-09
In [7]:
book_banning  %>% filter( book_id==148 ) %>% select( title, book_id, author, date )
A data.frame: 6 × 4
titlebook_idauthordate
<chr><fct><chr><date>
Angus, Thongs, and Full Frontal Snogging148Rennison, Louise2005-04-01
Angus, Thongs, and Full Frontal Snogging148Rennison, Louise2003-12-11
Knocked Out By My Nunga-Nungas 148Rennison, Louise2007-07-22
Angus, Thongs, and Full Frontal Snogging148Rennison, Louise2004-04-09
Angus, Thongs, and Full Frontal Snogging148Rennison, Louise2009-11-22
Angus, Thongs, and Full Frontal Snogging148Rennison, Louise2006-07-15
In [8]:
book_banning  %>% filter( book_id==356 ) %>% select( title, book_id, author, date )
A data.frame: 3 × 4
titlebook_idauthordate
<chr><fct><chr><date>
Captain Underpants 356Pilkey, Dav2010-08-12
Captain Underpants (series) 356Pilkey, Dav2005-06-09
Captain Underpants and the Perilous Plot of Professor Poopypants356Pilkey, Dav2002-06-24

The titles are different, but seem more or less the same for the same book ID, however this would need to be investigated more closely.

Are the book characteristics explicit, antifamily, occult, language, lgbtq and violent the same for all book challenges in a given state? To check this, compute the mean of all these variables grouped by state and book id. If the mean is different from 0 or 1, then the values per book are different. This appears not to be the case, so we can speak of a book-level that is a sublevel of state-level.

In [9]:
book_banning %>% 
    select( state, book_id, explicit:violent ) %>% 
    group_by( state, book_id ) %>% 
    summarize_all( \(x) { mean(as.numeric(x)-1) } ) %>% 
    filter_at( vars(explicit:violent), \(x) { !(x  %in%  c(1,0))} )
A grouped_df: 0 × 8
statebook_idexplicitantifamilyoccultlanguagelgbtqviolent
<chr><fct><dbl><dbl><dbl><dbl><dbl><dbl>

It remains to check, whether the state-level predictors are really only on the state-level. To that end, check whether the variance of any of the predictors is larger than zero:

In [10]:
book_banning %>% 
    select( state, political_value_index, median_income, hs_grad_rate, college_grad_rate ) %>% 
    group_by( state ) %>% 
    summarize_all( var ) %>% 
    filter_at( vars(political_value_index:college_grad_rate), \(x) {x>0})
A tibble: 0 × 5
statepolitical_value_indexmedian_incomehs_grad_ratecollege_grad_rate
<chr><dbl><dbl><dbl><dbl>

Exercise 19.2¶

Check whether peak-level predictors do not vary from peak to peak:

In [11]:
climbers_sub %>% 
    select( peak_id, height_metres, first_ascent_year ) %>% 
    group_by( peak_id ) %>% 
    summarize_all( var ) %>% 
    filter_at( vars(height_metres:first_ascent_year), \(x) !(x %in% c(0,NA)) )
A tibble: 0 × 3
peak_idheight_metresfirst_ascent_year
<fct><dbl><dbl>

Check whether expedition-level predictors do not vary from expedition to expedition (sublevel of peak level):

In [12]:
climbers_sub %>% 
    group_by( peak_id, expedition_id ) %>% 
    summarize( var_count = var(count) ) %>% 
    filter( var_count > 0 )
`summarise()` has grouped output by 'peak_id'. You can override using the
`.groups` argument.
A grouped_df: 0 × 3
peak_idexpedition_idvar_count
<fct><chr><dbl>

Check whether climber-level predictors do not vary from climber to climber (sublevel of expedition level):

In [13]:
climbers_sub %>% 
    select( peak_id, expedition_id, member_id, age, expedition_role ) %>% 
    group_by( peak_id, expedition_id, member_id ) %>% 
    summarize( count=n() ) %>% 
    filter( count>1 )
`summarise()` has grouped output by 'peak_id', 'expedition_id'. You can
override using the `.groups` argument.
A grouped_df: 0 × 4
peak_idexpedition_idmember_idcount
<fct><chr><fct><int>

No multiple entries for the same climber, so no variance.

Exercise 19.5¶

In [14]:
spotify_small <- spotify %>% 
      filter(artist %in% c("Beyoncé", "Camila Cabello", "Camilo",
                           "Frank Ocean", "J. Cole", "Kendrick Lamar")) %>% 
      select(artist, album_id, popularity, valence)

head( spotify_small )
A tibble: 6 × 4
artistalbum_idpopularityvalence
<fct><chr><dbl><dbl>
Beyoncé7dK54iZuOxXFarGhXwEXfF7555.2
Beyoncé39P7VD7qlg3Z0ltq60eHp77747.2
Beyoncé1gIC63gC3B7o7FfpPACZQJ7576.0
Beyoncé1gIC63gC3B7o7FfpPACZQJ7576.0
Beyoncé20E3PwDg1jaDdK9K565luD6647.2
Beyoncé0Zd10MKN5j9KwUST0TdBBB7250.9
In [15]:
dim( spotify_small )
  1. 155
  2. 4

a)¶

album_id is the second grouping variable. Usually, an artist produces multiple albums with multiple songs each, so we deal with a nested model. Multiple artists can participate in an album, but then it appears to be clearly defined by providing a list in the artist field.

b)¶

artists

$\to$ Beyoncé, Camila Cabello, Camilo, ..

$\to$ Album 1 of Beyoncé, Album 2 of Beyoncé, .., Album 1 of Camila Cabello, ..

$\to$ Song 1 of Album 1 of Beyoncé, .., Song 10 of Album 1 of Beyoncé, .., Song 1 of Album 1 of Camila Cabello, ..

Thus a nested structure with artists $\to$ albums $\to$ songs.

c)¶

In [16]:
options(repr.plot.width=15, repr.plot.height=10)
ggplot( spotify_small, aes(x=valence, y=popularity, color=artist) ) + 
    geom_point() +
    geom_line( aes(group=album_id, color=artist) )

Maybe not the nicest plot: Points represent individual songs, connected points indicate songs on the same album and colors represents artists. It appears that the variability within the songs of an individual album of an artist is smaller than the overall variability of songs/albums of an artist.

Exercise 19.6¶

a)¶

$$Y_{ij} | \beta_{0j}, \beta_1, \sigma_y \sim N(\mu_{ij}, \sigma_y^2), \quad \text{with } \mu_{ij} = \beta_{0j} + \beta_1 X_{ij}$$$$\beta_{0j}|\beta_0, \sigma_0 \sim N(\beta_0, \sigma_0^2)$$$$\beta_{0c}\sim N(m_0, s_0^2)$$$$\beta_{1}\sim N(0, s_1^2)$$$$\sigma_y \sim \text{Exp}(l_y)$$$$\sigma_0 \sim \text{Exp}(l_0)$$

where $X_{ij}$ and $Y_{ij}$ stands for the valence and popularity for song $i$ of artist $j$. $\beta_{0j}$ represents the different baseline popularities for the different artists, $\sigma_y$ the variability within the popularity of songs of an artist and $\sigma_0$ the variability in popularity between artists. $\beta_1$ represents the global slope parameter for valence.

b)¶

In [17]:
spotify_model_1 <- stan_glmer(
  popularity ~ valence + (1 | artist), 
  data = spotify_small, family = gaussian,
  prior_intercept = normal(0, 2.5, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735
)
In [18]:
options(repr.plot.width=15, repr.plot.height=5)
pp_check(spotify_model_1) + xlab("popularity")

The model does not represent the measured data well! Adding album-level information might reduce the inductive bias of the model.

c)¶

$$Y_{ijk} | \beta_{0}, b_{0j}, p_{0k}, \beta_1, \sigma_y \sim N(\mu_{ij}, \sigma_y^2), \quad \text{with } \mu_{ij} = \left(\beta_{0} + b_{0j} + p_{0k}\right) + \beta_1 X_{ijk}$$$$b_{0j}|\sigma_b \sim N(0, \sigma_b^2)$$$$p_{0k}|\sigma_p \sim N(0, \sigma_p^2)$$$$\beta_{0c}\sim N(m_0, s_0^2)$$$$\beta_{1}\sim N(0, s_1^2)$$$$\sigma_y \sim \text{Exp}(l_y)$$

Here, $X_{ijk}$ and $Y_{ijk}$ stand for the valence and popularity for song $i$ in album $j$ of artist $k$. $b_{0j}$ represents the different baseline popularities for the different albums of an artist, $p_{0k}$ the different artists and $\sigma_y$ the variability within the popularity of songs of a particular album of a particular artist.$\beta_1$ represents the global slope parameter for valence.

d)¶

In [19]:
spotify_model_2 <- stan_glmer(
  popularity ~ valence + (1 | album_id) + (1 | artist), 
  data = spotify_small, family = gaussian,
  prior_intercept = normal(0, 2.5, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735
)
Warning message:
“There were 6 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.”
Warning message:
“Examine the pairs() plot to diagnose sampling problems
”

MCMC diagnostics:

In [20]:
options(repr.plot.width=15, repr.plot.height=15)
mcmc_trace(spotify_model_2, size = 0.1)
mcmc_dens_overlay(spotify_model_2)
mcmc_acf(spotify_model_2)
neff_ratio(spotify_model_2)
rhat(spotify_model_2)
Warning message:
“The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0.
ℹ Please use the `rows` argument instead.
ℹ The deprecated feature was likely used in the bayesplot package.
  Please report the issue at <https://github.com/stan-dev/bayesplot/issues/>.”
(Intercept)
0.35085
valence
0.75145
b[(Intercept) album_id:06v9eHnqhMK2tbM2Iz3p0Y]
0.4236
b[(Intercept) album_id:0iqqnLXoocsMeCYlTw3Q2q]
0.3548
b[(Intercept) album_id:0ljGAdQ5EmiJE52O1VsYAM]
0.16525
b[(Intercept) album_id:0Q8Oo7AyaEh0EzwSIOd1o0]
0.4732
b[(Intercept) album_id:0Zd10MKN5j9KwUST0TdBBB]
0.3742
b[(Intercept) album_id:13WjgUEEAQp0d9JqojlWp1]
0.22265
b[(Intercept) album_id:1gIC63gC3B7o7FfpPACZQJ]
0.3408
b[(Intercept) album_id:1m98axLPQPf3odJJQYc5ok]
0.3902
b[(Intercept) album_id:1NfrmcXk8xNennyxQ57JcW]
0.43315
b[(Intercept) album_id:1SgMcZHkKbhSIp820lMy8F]
0.3006
b[(Intercept) album_id:20E3PwDg1jaDdK9K565luD]
0.43905
b[(Intercept) album_id:23Y5wdyP5byMFktZf8AcWU]
0.4279
b[(Intercept) album_id:2dq4ae5hiyxlFPG1s8rlq5]
0.28865
b[(Intercept) album_id:2h1p1JzjT9rLJFAFrIkyrm]
0.41255
b[(Intercept) album_id:2UJwKSBUz6rtW4QLK74kQu]
0.2593
b[(Intercept) album_id:2vD3zSQr8hNlg0obNel4TE]
0.27315
b[(Intercept) album_id:392p3shh2jkxUxY2VHvlH8]
0.2586
b[(Intercept) album_id:39P7VD7qlg3Z0ltq60eHp7]
0.3034
b[(Intercept) album_id:3CCnGldVQ90c26aFATC1PW]
0.4457
b[(Intercept) album_id:3DGQ1iZ9XKUQxAUWjfC34w]
0.16085
b[(Intercept) album_id:3g56eEg5YgMf3LZPHCMOx2]
0.439
b[(Intercept) album_id:3HR8mnPQ7Go27WPMTNR2um]
0.45035
b[(Intercept) album_id:3mH6qwIy9crq0I9YQbOuDf]
0.2547
b[(Intercept) album_id:3pLdWdkj83EYfDN6H2N8MR]
0.2426
b[(Intercept) album_id:3RYdEXhGHojkTILUdtnRVJ]
0.4662
b[(Intercept) album_id:3TLaFWQDhV1g39Qwd5sPAm]
0.4351
b[(Intercept) album_id:3Vsbl0diFGw8HNSjG8ue9m]
0.2539
b[(Intercept) album_id:3xgS4rfthlsvhW7J2WLiiR]
0.2698
b[(Intercept) album_id:3XzSOIE6zGLliuqsVGLmUc]
0.344
b[(Intercept) album_id:46H4UzFHMLnGgpyrj9HpKr]
0.4594
b[(Intercept) album_id:4eLPsYPBmXABThSJ821sqY]
0.148
b[(Intercept) album_id:514p9UpvibpBhShFgpZaxS]
0.3464
b[(Intercept) album_id:552zi1M53PQAX5OH4FIdTx]
0.3412
b[(Intercept) album_id:5chBPOVY2I0bG5V3igb5QL]
0.3154
b[(Intercept) album_id:5F4I5eUqVK1CpQbzt1ntMC]
0.3733
b[(Intercept) album_id:5uP9oyMK5lpzbB7K6UeT3X]
0.22065
b[(Intercept) album_id:5yv7CeuVtzTi3SbOjCMOJH]
0.4565
b[(Intercept) album_id:623Ef2ZEB3Njklix4PC0Rs]
0.32355
b[(Intercept) album_id:6OGzmhzHcjf0uN9j7dYvZH]
0.31835
b[(Intercept) album_id:6oxVabMIqCMJRYN1GqR3Vf]
0.3045
b[(Intercept) album_id:6PBZN8cbwkqm1ERj2BGXJ1]
0.17885
b[(Intercept) album_id:6Pc3YAtxdkZba2tpmeXAXW]
0.3226
b[(Intercept) album_id:6xu5asYeoMIT5Sa5b1P13q]
0.30985
b[(Intercept) album_id:71VX8yv9T2hNIYVZJVUWVp]
0.4479
b[(Intercept) album_id:745wm87LGZxRYeyGM0z96J]
0.224
b[(Intercept) album_id:748dZDqSZy6aPXKcI9H80u]
0.1758
b[(Intercept) album_id:7dK54iZuOxXFarGhXwEXfF]
0.2649
b[(Intercept) album_id:7uSaHPXSYaRB3FxLOmatG9]
0.4585
b[(Intercept) album_id:7viNUmZZ8ztn2UB4XB3jIL]
0.34515
b[(Intercept) album_id:7wbJhbCvhPfbK1CLAkpq25]
0.17695
b[(Intercept) artist:Beyoncé]
0.31915
b[(Intercept) artist:Camila_Cabello]
0.30865
b[(Intercept) artist:Camilo]
0.3625
b[(Intercept) artist:Frank_Ocean]
0.30975
b[(Intercept) artist:J._Cole]
0.37945
b[(Intercept) artist:Kendrick_Lamar]
0.1517
sigma
0.5243
Sigma[album_id:(Intercept),(Intercept)]
0.21745
Sigma[artist:(Intercept),(Intercept)]
0.25915
(Intercept)
1.00035203476114
valence
1.00029825527803
b[(Intercept) album_id:06v9eHnqhMK2tbM2Iz3p0Y]
1.00017020116078
b[(Intercept) album_id:0iqqnLXoocsMeCYlTw3Q2q]
1.00039912134635
b[(Intercept) album_id:0ljGAdQ5EmiJE52O1VsYAM]
1.00275651112196
b[(Intercept) album_id:0Q8Oo7AyaEh0EzwSIOd1o0]
1.00040225473846
b[(Intercept) album_id:0Zd10MKN5j9KwUST0TdBBB]
1.00043792789052
b[(Intercept) album_id:13WjgUEEAQp0d9JqojlWp1]
1.00200540673837
b[(Intercept) album_id:1gIC63gC3B7o7FfpPACZQJ]
1.00003191618779
b[(Intercept) album_id:1m98axLPQPf3odJJQYc5ok]
1.00036137975309
b[(Intercept) album_id:1NfrmcXk8xNennyxQ57JcW]
1.00023943339942
b[(Intercept) album_id:1SgMcZHkKbhSIp820lMy8F]
1.00038152424305
b[(Intercept) album_id:20E3PwDg1jaDdK9K565luD]
1.00007603971066
b[(Intercept) album_id:23Y5wdyP5byMFktZf8AcWU]
1.00020075646953
b[(Intercept) album_id:2dq4ae5hiyxlFPG1s8rlq5]
1.00127238738113
b[(Intercept) album_id:2h1p1JzjT9rLJFAFrIkyrm]
1.00035227544921
b[(Intercept) album_id:2UJwKSBUz6rtW4QLK74kQu]
1.00039422869682
b[(Intercept) album_id:2vD3zSQr8hNlg0obNel4TE]
1.00129824539667
b[(Intercept) album_id:392p3shh2jkxUxY2VHvlH8]
1.00100521550837
b[(Intercept) album_id:39P7VD7qlg3Z0ltq60eHp7]
1.0004752278619
b[(Intercept) album_id:3CCnGldVQ90c26aFATC1PW]
1.0000612590744
b[(Intercept) album_id:3DGQ1iZ9XKUQxAUWjfC34w]
1.00249125934144
b[(Intercept) album_id:3g56eEg5YgMf3LZPHCMOx2]
1.00055049227741
b[(Intercept) album_id:3HR8mnPQ7Go27WPMTNR2um]
1.00082797213768
b[(Intercept) album_id:3mH6qwIy9crq0I9YQbOuDf]
1.0007914925524
b[(Intercept) album_id:3pLdWdkj83EYfDN6H2N8MR]
1.00152415677089
b[(Intercept) album_id:3RYdEXhGHojkTILUdtnRVJ]
1.0005865393904
b[(Intercept) album_id:3TLaFWQDhV1g39Qwd5sPAm]
1.00036140238066
b[(Intercept) album_id:3Vsbl0diFGw8HNSjG8ue9m]
1.00127732373153
b[(Intercept) album_id:3xgS4rfthlsvhW7J2WLiiR]
1.00160221778478
b[(Intercept) album_id:3XzSOIE6zGLliuqsVGLmUc]
1.00044876168671
b[(Intercept) album_id:46H4UzFHMLnGgpyrj9HpKr]
1.0001571455182
b[(Intercept) album_id:4eLPsYPBmXABThSJ821sqY]
1.00278655810024
b[(Intercept) album_id:514p9UpvibpBhShFgpZaxS]
1.00095707402787
b[(Intercept) album_id:552zi1M53PQAX5OH4FIdTx]
1.00032091743794
b[(Intercept) album_id:5chBPOVY2I0bG5V3igb5QL]
1.00146436001998
b[(Intercept) album_id:5F4I5eUqVK1CpQbzt1ntMC]
1.00070780787057
b[(Intercept) album_id:5uP9oyMK5lpzbB7K6UeT3X]
1.00183342157379
b[(Intercept) album_id:5yv7CeuVtzTi3SbOjCMOJH]
1.00089859667103
b[(Intercept) album_id:623Ef2ZEB3Njklix4PC0Rs]
1.00070068722321
b[(Intercept) album_id:6OGzmhzHcjf0uN9j7dYvZH]
1.00078105500781
b[(Intercept) album_id:6oxVabMIqCMJRYN1GqR3Vf]
1.00035120488873
b[(Intercept) album_id:6PBZN8cbwkqm1ERj2BGXJ1]
1.00250292600176
b[(Intercept) album_id:6Pc3YAtxdkZba2tpmeXAXW]
1.00022625524942
b[(Intercept) album_id:6xu5asYeoMIT5Sa5b1P13q]
1.00032010070201
b[(Intercept) album_id:71VX8yv9T2hNIYVZJVUWVp]
1.00069608302021
b[(Intercept) album_id:745wm87LGZxRYeyGM0z96J]
1.00185649536081
b[(Intercept) album_id:748dZDqSZy6aPXKcI9H80u]
1.00235271313676
b[(Intercept) album_id:7dK54iZuOxXFarGhXwEXfF]
1.00053757584738
b[(Intercept) album_id:7uSaHPXSYaRB3FxLOmatG9]
1.00021715900494
b[(Intercept) album_id:7viNUmZZ8ztn2UB4XB3jIL]
1.00021273003205
b[(Intercept) album_id:7wbJhbCvhPfbK1CLAkpq25]
1.00214813624912
b[(Intercept) artist:Beyoncé]
1.00064857963133
b[(Intercept) artist:Camila_Cabello]
1.00163003758047
b[(Intercept) artist:Camilo]
1.0005251985511
b[(Intercept) artist:Frank_Ocean]
1.00049912183743
b[(Intercept) artist:J._Cole]
1.00055371100796
b[(Intercept) artist:Kendrick_Lamar]
1.00250689587103
sigma
1.0001397615631
Sigma[album_id:(Intercept),(Intercept)]
1.00179399650413
Sigma[artist:(Intercept),(Intercept)]
1.00102473036717

Looks ok, ignore the warning, however in practice one should tweak the MCMC parameters here.

Posterior predictive check:

In [21]:
pp_check(spotify_model_2) + xlab("popularity")

Wow, this is much better!

Exercise 19.7¶

a)¶

Fixed effects:

In [22]:
tidy(spotify_model_2, effects = "fixed")
A tibble: 2 × 3
termestimatestd.error
<chr><dbl><dbl>
(Intercept)61.958505035.01287305
valence 0.064802140.03787398

Random effects:

In [23]:
tidy(spotify_model_2, effects = "ran_vals") %>% 
    filter( level %in% c("Kendrick_Lamar", "748dZDqSZy6aPXKcI9H80u") )
A tibble: 2 × 5
levelgrouptermestimatestd.error
<chr><chr><chr><dbl><dbl>
748dZDqSZy6aPXKcI9H80ualbum_id(Intercept) 19.025597.044108
Kendrick_Lamar artist (Intercept)-16.423057.635712
  • Albums by artists not included in the spotify_small dataset: popularity = 62.0 + 0.065 * valence
  • A new album by Kendrick Lamar: popularity = 62.0 - 16.4 + 0.065 * valence = 45.5 + 0.065 * valence
  • Kendrick Lamar’s “good kid, m.A.A.d city” album: 62.0 - 16.4 + 19.0 + 0.065 * valence = 64.6 + 0.065 * valence

b)¶

Kendrick Lamar has in general a worse average popularity than the average of our sample. However his Album “good kid, m.A.A.d city” was so popular that a song in it has a larger popularity than the average of our sample.

c)¶

In [24]:
tidy(spotify_model_2, effects = "ran_vals") %>% 
    filter( group == "artist" ) %>% 
    arrange( desc(estimate) )
A tibble: 6 × 5
levelgrouptermestimatestd.error
<chr><chr><chr><dbl><dbl>
Camilo artist(Intercept) 6.2986707.279473
Camila_Cabelloartist(Intercept) 3.7487135.899080
J._Cole artist(Intercept) 3.6013966.845808
Beyoncé artist(Intercept) 2.8202265.765683
Frank_Ocean artist(Intercept) 1.7914175.608984
Kendrick_Lamarartist(Intercept)-16.4230557.635712

Camilo.

d)¶

Most popular album:

In [25]:
tidy(spotify_model_2, effects = "ran_vals") %>% 
    filter( group == "album_id" ) %>% 
    arrange( desc(estimate) ) %>% 
    head()
A tibble: 6 × 5
levelgrouptermestimatestd.error
<chr><chr><chr><dbl><dbl>
4eLPsYPBmXABThSJ821sqYalbum_id(Intercept)30.250106.429640
3pLdWdkj83EYfDN6H2N8MRalbum_id(Intercept)26.848687.875206
748dZDqSZy6aPXKcI9H80ualbum_id(Intercept)19.025597.044108
6PBZN8cbwkqm1ERj2BGXJ1album_id(Intercept)18.184937.323289
3XzSOIE6zGLliuqsVGLmUcalbum_id(Intercept)16.675677.243112
3Vsbl0diFGw8HNSjG8ue9malbum_id(Intercept)16.509755.246488
In [26]:
spotify %>%
    filter( album_id == "4eLPsYPBmXABThSJ821sqY" ) %>% 
    select( artist, album_name ) %>% 
    first()
A tibble: 1 × 2
artistalbum_name
<fct><chr>
Kendrick LamarDAMN.

Exercise 19.8¶

In [27]:
tidy(spotify_model_2, effects = "ran_pars") %>% 
    mutate( squared = estimate^2 ) %>% 
    mutate( proportion = squared/sum(squared) ) %>% 
    select( -squared )
A tibble: 3 × 4
termgroupestimateproportion
<chr><chr><dbl><dbl>
sd_(Intercept).album_idalbum_id17.2992460.65699327
sd_(Intercept).artist artist 11.0299600.26708805
sd_Observation.ResidualResidual 5.8805920.07591869

$\sigma_y = 5.9, \sigma_b = 17.3, \sigma_p = 11.0$ ($\sigma_b$ represents variability between different albums of an artist and $\sigma_p$ variability between artists).

The largest proportion of variability in popularity (65%) comes from $\sigma_b$, i.e. the albums of an artist, followed by $\sigma_p$, the variability of popularity between artists. The variability of the popularity of songs within an album is interestingly the smallest. It appears thus that most artists do not consistently produce good albums, but rather a diverse range of popular and unpopular albums. The variance in the individual songs of an album is rather small (7.6%) compared with the other effects, i.e. if one song in an album is popular, usually the rest is also popular and the other way round.

Exercise 19.9¶

In [28]:
bwc <- big_word_club %>% 
      mutate(treat = as.factor(treat))

a)¶

  • school-level predictors: treat, private_school, free_reduced_lunch
  • student-level predictors: age_months, esl_observed (English as a second language)

Interestingly, treat is a school-level predictor and the same for all students at a particular school.

In [29]:
bwc %>% 
    group_by( school_id ) %>% 
    summarize( 
        var_private_school = var(private_school),
        var_lunch = var(free_reduced_lunch),
        var_treat = var(as.numeric(treat)-1),
        var_age_months = var(age_months),
        var_esl_observed = var(esl_observed)
    ) %>% 
    mutate_if( is.factor, \(x) as.numeric(x)-1 ) %>% 
    summarize_all( max ) %>% 
    select(-school_id)
A tibble: 1 × 5
var_private_schoolvar_lunchvar_treatvar_age_monthsvar_esl_observed
<dbl><dbl><dbl><dbl><dbl>
000174.86030.2614379

No variance among the school-level predictors, variance among the student-level predictors.

b)¶

How many tests per student?

In [30]:
bwc %>% 
    count( participant_id ) %>% 
    filter( n>1 )
A data.frame: 0 × 2
participant_idn
<int><int>

Only one! Not a nested grouping structure.

Use a normal hierarchical regression model with random intercepts. For simplicity, do not use random slopes.

$$Y_{ij} | \beta_{0}, b_{0j}, \beta_1, \sigma_y \sim N(\mu_{ij}, \sigma_y^2), \quad \text{with } \mu_{ij} = \left(\beta_{0} + b_{0j}\right) + \beta_1 U_{j1} + \beta_2 U_{j2} + \beta_3 U_{j3} \beta_3 X_{ijk3} + \beta_4 X_{ij4} + \beta_5 X_{k5}$$$$b_{0j}|\sigma_b \sim N(0, \sigma_b^2)$$$$\beta_{0c}, \beta_{1}, \beta_{2}, \beta_{3}, \beta_{4}, \beta_{5}\sim N(0, s_0^2)$$$$\sigma_b \sim \text{Exp}(l_b)$$$$\sigma_y \sim \text{Exp}(l_y)$$

Here, $Y_{ij}$ stands for percent change in score for student $i$ at school $j$. $b_{0j}$ represents the different baseline percent changes for the different schools and $\sigma_y$ the variability within the different students at an individual school. $\beta_1$, $\beta_2$ and $\beta_3$ are the global slope parameters for the school-level predictors treat, private_school and free_reduced_lunch (denoted as $U_{j1}$, $U_{j2}$ and $U_{j3}$). $\beta_4$ and $\beta_5$ are the slopes for the student-level predictors age_months and esl_observed (denoted as $X_{ij1}$ and $X_{ij2}$).

c)¶

In [31]:
bwc_model <- stan_glmer(
  score_pct_change ~ age_months + esl_observed + treat + private_school + 
                     free_reduced_lunch + (1 | school_id), 
  data = bwc, family = gaussian,
  prior_intercept = normal(0, 2.5, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735,
)

Diagnostics:

In [32]:
options(repr.plot.width=15, repr.plot.height=15)
mcmc_trace(bwc_model, size = 0.1)
mcmc_dens_overlay(bwc_model)
mcmc_acf(bwc_model)
neff_ratio(bwc_model)
rhat(bwc_model)
(Intercept)
1.46045
age_months
1.5128
esl_observed
1.35055
treat1
1.38495
private_school
1.05085
free_reduced_lunch
0.98395
b[(Intercept) school_id:1]
1.44375
b[(Intercept) school_id:2]
0.76115
b[(Intercept) school_id:3]
0.78305
b[(Intercept) school_id:4]
1.45975
b[(Intercept) school_id:5]
1.2175
b[(Intercept) school_id:6]
0.6265
b[(Intercept) school_id:7]
1.3461
b[(Intercept) school_id:8]
0.84195
b[(Intercept) school_id:9]
0.967
b[(Intercept) school_id:10]
0.8392
b[(Intercept) school_id:11]
1.4251
b[(Intercept) school_id:12]
1.67155
b[(Intercept) school_id:13]
1.1484
b[(Intercept) school_id:14]
0.8924
b[(Intercept) school_id:15]
1.5341
b[(Intercept) school_id:16]
1.5602
b[(Intercept) school_id:17]
0.88855
b[(Intercept) school_id:18]
0.95585
b[(Intercept) school_id:19]
1.4734
b[(Intercept) school_id:20]
1.48615
b[(Intercept) school_id:21]
1.49395
b[(Intercept) school_id:22]
1.5086
b[(Intercept) school_id:23]
1.43985
b[(Intercept) school_id:24]
0.99185
b[(Intercept) school_id:25]
1.6161
b[(Intercept) school_id:26]
0.79345
b[(Intercept) school_id:27]
1.35
b[(Intercept) school_id:28]
1.20875
b[(Intercept) school_id:29]
1.5334
b[(Intercept) school_id:30]
0.5524
b[(Intercept) school_id:31]
1.26505
b[(Intercept) school_id:32]
1.58735
b[(Intercept) school_id:33]
1.53995
b[(Intercept) school_id:34]
0.9742
b[(Intercept) school_id:35]
1.20365
b[(Intercept) school_id:36]
1.5536
b[(Intercept) school_id:37]
0.85045
b[(Intercept) school_id:38]
1.36305
b[(Intercept) school_id:39]
1.50755
b[(Intercept) school_id:40]
1.30075
b[(Intercept) school_id:41]
0.87355
b[(Intercept) school_id:42]
1.443
b[(Intercept) school_id:43]
1.1239
b[(Intercept) school_id:44]
1.39735
b[(Intercept) school_id:45]
1.55745
b[(Intercept) school_id:46]
0.9692
b[(Intercept) school_id:47]
0.62915
sigma
1.4953
Sigma[school_id:(Intercept),(Intercept)]
0.3138
(Intercept)
0.999963895469878
age_months
0.999910673350221
esl_observed
1.00001858124968
treat1
0.999858432940349
private_school
0.99994731737128
free_reduced_lunch
1.00009558187759
b[(Intercept) school_id:1]
1.00003057704523
b[(Intercept) school_id:2]
1.00012939683328
b[(Intercept) school_id:3]
1.00009658594212
b[(Intercept) school_id:4]
0.999879061504234
b[(Intercept) school_id:5]
1.00003201029245
b[(Intercept) school_id:6]
1.00014332534384
b[(Intercept) school_id:7]
0.999894516431283
b[(Intercept) school_id:8]
0.999924162442728
b[(Intercept) school_id:9]
0.999974608127983
b[(Intercept) school_id:10]
1.00005036509837
b[(Intercept) school_id:11]
0.999922218567218
b[(Intercept) school_id:12]
0.999836409853311
b[(Intercept) school_id:13]
0.999868760604748
b[(Intercept) school_id:14]
0.999943702816528
b[(Intercept) school_id:15]
0.999929970012399
b[(Intercept) school_id:16]
0.999905768388553
b[(Intercept) school_id:17]
1.00020087421546
b[(Intercept) school_id:18]
1.00010158018564
b[(Intercept) school_id:19]
0.999955157432509
b[(Intercept) school_id:20]
0.999952134521013
b[(Intercept) school_id:21]
0.999874910801132
b[(Intercept) school_id:22]
0.999866433498114
b[(Intercept) school_id:23]
1.00005077149434
b[(Intercept) school_id:24]
1.00008374033502
b[(Intercept) school_id:25]
0.99999640878468
b[(Intercept) school_id:26]
1.0000112269369
b[(Intercept) school_id:27]
0.999993226181403
b[(Intercept) school_id:28]
0.999939735368603
b[(Intercept) school_id:29]
0.999915844770855
b[(Intercept) school_id:30]
1.00003986824606
b[(Intercept) school_id:31]
0.99997906928745
b[(Intercept) school_id:32]
0.999898537880524
b[(Intercept) school_id:33]
0.999888960489468
b[(Intercept) school_id:34]
0.999914398658193
b[(Intercept) school_id:35]
0.999875158512331
b[(Intercept) school_id:36]
0.999960251342051
b[(Intercept) school_id:37]
0.999931847092014
b[(Intercept) school_id:38]
0.999968745226516
b[(Intercept) school_id:39]
1.00000235120939
b[(Intercept) school_id:40]
1.00014427938931
b[(Intercept) school_id:41]
1.00013326855701
b[(Intercept) school_id:42]
0.999915570059903
b[(Intercept) school_id:43]
0.999966831237535
b[(Intercept) school_id:44]
0.999879802140225
b[(Intercept) school_id:45]
0.999923343063587
b[(Intercept) school_id:46]
1.00009716307521
b[(Intercept) school_id:47]
1.0001247859529
sigma
0.99991554219763
Sigma[school_id:(Intercept),(Intercept)]
1.00065064113783

Looks good!

Posterior predictive check for entire model:

In [33]:
options(repr.plot.width=15, repr.plot.height=5)
pp_check(bwc_model) + xlab("score percent change")

Reasonable, although the normal does not capture the data distribution perfectly.

Exercise 19.10¶

a)¶

In [34]:
tidy(bwc_model, effects = "fixed", conf.int = TRUE, conf.level = 0.80)
A tibble: 6 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept) 17.612537716.46763770 9.4369344125.81506041
age_months -0.171891810.08883075-0.28411372-0.05867890
esl_observed 0.737549422.80741039-2.83367186 4.34695532
treat1 -0.882777441.63783580-2.97946784 1.23215486
private_school 0.892109813.07991282-3.15180117 4.83356378
free_reduced_lunch 0.018170090.03533995-0.02757163 0.06304653

Only age_months appears to be significant at the 80% level, with a large uncertainty. It appears that with each month of age, score_pct_change is reduced by 0.17.

b)¶

In [35]:
tidy(bwc_model, effects = "ran_pars") %>% 
    mutate( squared=estimate^2, proportion=squared/sum(squared) ) %>% 
    select( -squared )
A tibble: 2 × 4
termgroupestimateproportion
<chr><chr><dbl><dbl>
sd_(Intercept).school_idschool_id 2.5099920.02033181
sd_Observation.Residual Residual 17.4230320.97966819

$\sigma_y=17.4$, $\sigma_b=2.5$. The variability between the students of a school clearly dominates the overall variability.

Exercise 19.11¶

In [36]:
school_level_effects <- tidy(bwc_model, effects = "ran_vals", conf.int = TRUE, conf.level = 0.80)
head( school_level_effects )
A tibble: 6 × 7
levelgrouptermestimatestd.errorconf.lowconf.high
<chr><chr><chr><dbl><dbl><dbl><dbl>
1school_id(Intercept)-0.21283941.595665-3.22225161.9455620
2school_id(Intercept) 0.92693471.816862-0.89078464.6999950
3school_id(Intercept) 0.88327021.820799-0.94008624.6748250
4school_id(Intercept) 0.17426171.596676-2.02698833.0610136
5school_id(Intercept) 0.31633621.607116-1.68332853.4417448
6school_id(Intercept)-1.21078412.042551-5.51642430.6814634

a)¶

In [37]:
school_level_effects %>% 
    filter( level == "30" )
A tibble: 1 × 7
levelgrouptermestimatestd.errorconf.lowconf.high
<chr><chr><chr><dbl><dbl><dbl><dbl>
30school_id(Intercept)-1.5917762.353159-6.0426350.3968267

Posterior median model: score_pct_change = 17.6 - 1.6 - 0.17 age_months = 16.0 - 0.17 age_months (reporting only significant coefficients)

b)¶

In [38]:
school_level_effects %>% 
    filter( level == "47" )
A tibble: 1 × 7
levelgrouptermestimatestd.errorconf.lowconf.high
<chr><chr><chr><dbl><dbl><dbl><dbl>
47school_id(Intercept)1.3548912.144308-0.56963265.652695

Posterior median model: score_pct_change = 17.6 + 1.4 - 0.17 age_months = 19.0 - 0.17 age_months (reporting only significant coefficients)

c)¶

Posterior median model: score_pct_change = 17.6 - 0.17 age_months + 0.018 * 95 = 19.3 - 0.17 age_months

(using the non-significant contribution of free_reduced_lunch)

d)¶

Posterior median model: score_pct_change = 17.6 - 0.17 age_months + 0.018 * 10 = 17.8 - 0.17 age_months

(using the non-significant contribution of free_reduced_lunch)

Exercise 19.12¶

a)¶

None at a significant level. Maybe with more data private_school, maybe also the availability of free or reduced lunch, however I do not see this directly.

b)¶

The younger the student, the greater the vocabular improvement (at a significant level, yet with quite some uncertainty). Maybe with more data it also helps being an ESL student (to have more change / maybe a steeper learning curve).